Oh Geez! - a Rick and Morty text analysis
=Oh geez! - a text analysis= With a season 4 approaching in November and promised total of 70 episodes in the pipeline, I turned to https://rickandmorty.fandom.com/wiki/Rickipedia to warm up and refresh my memory. It turned out the wiki contains information in a pretty structured way, which means only one thing - we can conduct data analysis on the contents! Initial thoughts I decided to use the following data - plot description of each episode and transcript of dialogues. Even before diving in the code, I knew there would be some challenges: * It was my assumption based on reading through some articles, that the data is structured in the same way for each episode. However, there was still a chance some pages were blank, missing or simply different. * Fandom wikis are community led and open for contributions. Each article creator produces content with her own subjective perspective, which means each text can be biased in a different way - depending on what one sees as important, moving, funny etc. Regardless, let’s see what information can be extracted. The goals for this analysis are: * Analyze what verbs are connected with each character. * Compare what characters are doing vs. what they are saying. * Bonus geez question. * Get familiar with scraping Fandom wikis (requests, BeautifulSoup). * Practice simple text analysis (nltk). The notebook with code can be found on github . What are Rick and Morty doing? I started with putting together plot texts from each episode article. As they are written in third person - they contain a lot of sentences like “Rick went”, “Morty said” and so on. The goal was to extract bigrams (pairs of words) that will consist of: * The name of a character - here I looked at Rick, Morty, Jerry, Beth and Summer. * A word classified as verb. For part of speech classification I am using nltk.pos_tag() function. The result is as follows: As we can see, the verbs take different forms - past, present, continuous. This problem can be approached with lemmatization, which means grouping words by their dictionary form. I used WordNetLemmatizer available in nltk library and got the following results: Word cloud for each of the Sanchez / Smith family member: IMAGE 2019-10-08 21:03:09.jpg IMAGE 2019-10-08 21:03:11.jpg IMAGE 2019-10-08 21:03:15.jpg IMAGE 2019-10-08 21:03:18.jpg IMAGE 2019-10-08 21:03:20.jpg This is really interesting, especially for the two main characters - Rick primarily “says” while Morty often “asks”, which confirms their relation given that Rick leads (majority of) their adventures. There are a lot of words that underline adventurous aspect of their lives: “create”, “attempt, “follow”, “show” (Rick) and “find”, ‘learn”, “run” (Morty). Such activities are less obvious for the rest of the Smith family, they “take”, “tell”, “call” (Beth), “go”, “believe” (Jerry) and “go”, “leave”, “appear” (Summer). Interestingly Beth and Summer are also connected with phrases that can be interpreted as their reaction to crazy ideas of the crazy grandpa - grandson team (“frustrate”, “is mortified”). We are able to extract interesting insights with such simple analysis, but it’s important not to draw bold statements based only on summary data. And indeed, the sample that is available on wiki is limited, which is revealed only after looking at absolute count of verbs per character. Knowing the limitations, we can turn to aggregated data from all episodes and characters. What are the top words used in plot descriptions? This time I’m filtering only for nouns, as often used verbs as “go”, “take” etc. are not carrying meaning without the context. I am using the same nltk.pos_tag() function as in the verb extraction above. Additionally I filtered out main character names and few helper nouns that were not removed during data clean up. My attention is caught first by words that, in my view, should be classified as verbs. However, double checking what nltk.pos_tag(“tell”) returns is indeed a noun. “Tell” can also be a noun (e.g. check https://en.wiktionary.org/wiki/tell#Noun), so it seems nltk does not always return the most expected part-of-speech. I will continue the analysis with such results, but it would be interesting to dig deeper, how to approach this issue. It looks that these words can be categorized in two groups - one connected to home: family, Earth, people, room; and then the other, the adventurous one: killing, planet, ship, attempt, alien. Notably, planet is mentioned more often than Earth. The top word is “time” which is hard to assign to any of those two groups. With these words we get an idea of what is happening in the episodes. The next question that comes to mind is - are characters’ dialogues indicating the same things? What words are used in the context of plot descriptions? The top 3 words (I also filtered for nouns) used by characters in total are “back”, “time” and “family”, and each almost doubles the count of any other word. Except for “time” that appears again and does not carry one meaning of either home or adventure, (going?) back and family seem to be most important things characters speak about. In the context of intergalactic adventures we could be inclined to associate “back” with going back home, but that’s an assumption. This may come a bit as a surprise, as the impression I got after watching the show several times was that family related issues were more or less hidden behind the crazy adventures. Words and plot descriptions point though that family is the most repeating topic that the two sources share. If I had to use two words to make someone guess I’m talking about Rick and Morty show I would use “Oh geez!”. Having access to what the characters are saying, I can’t help but ask - how many times per episode we hear the iconic phrase? It’s usually Morty, when things get really out of control, happens also to Rick, when he’s irritated. Turns out, for most episodes it’s just 1 or 2 times, for some even none. But there are spikes for the pilot episode and 7th episode in the third season (there were 10, however transcripts for the last two were missing). Is this a TV trick to make the first and last impression strong, so that a phrase is deeper engraved in audience’s minds? Summary With very simple nltk tools, we were able to extract a few general insights about Rick and Morty TV show: * Small sample of characters activities confirms the assumption we have regarding the characters - Rick leads adventures, Morty asks a lot but has also active roles in leaving, finding and forcing. Beth, Jerry and Summer, although often drawn into adventures, are not connected with strictly adventurous verbs. * Based on wiki content, adventures are rather a background for family related topics, as “family” is the most recurring topic in both plot descriptions and characters’ dialogues. * Nltk tools (lemmatization and pos_tag function to extract verbs and nouns) without customisation provide good enough results. * It is enough for a catchy phrase to become iconic if you hear it 2 to 3 times per episode on average. * Scraping wikis is easy and available data leaves many questions open for further analysis! Author: Dorota MierzwaCategory:Data analyses